An index to quantify an individual's scientific research output 
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I propose the index h, defined as the number of papers with citation number higher or equal to 
h, as a useful index to characterize the scientific output of a researcher. 

PACS numbers: 



For the few scientists that earn a Nobel prize, the im- 
pact and relevance of their research work is unquestion- 
able. Among the rest of us, how does one quantify the 
cumulative impact and relevance of an individual's sci- 
entific research output? In a world of not unlimited re- 
sources such quantification (even if potentially distaste- 
ful) is often needed for evaluation and comparison pur- 
poses, eg for university faculty recruitment and advance- 
ment, award of grants, etc. 

The publication record of an individual and the ci- 
tation record are clearly data that contain useful infor- 
mation. That information includes the number {Np) of 
papers published over n years, the number of citations 
(iV^) for each paper (j), the journals where the papers 
were published and their impact parameter, etc. This is a 
large amount of information that will be evaluated with 
different criteria by different people. Here I would like 
to propose a single number, the "/i-index", as a particu- 
larly simple and useful way to characterize the scientific 
output of a researcher. 

A scientist has index h if h of his/her Np papers have 
at least h citations each, and the other {Np — h) papers 
have no more than h citations each. 

The research reported here concentrated on physicists, 
however I suggest that the /i— index should be useful for 
other scientific disciplines as well. (At the end of the pa- 
per I discuss some observations for the /i— index in bio- 
logical sciences.) The highest h among physicists appears 
to be E. Witten's, h = 110. That is, Witten has written 
110 papers with at least 110 citations each. That gives 
a lower bound on the total number of citations to Wit- 
ten's papers at = 12, 100. Of course the total number 
of citations {Nc,tot) will usually be much larger than /i^, 
since both underestimates the total number of cita- 
tions of the h most cited papers and ignores the papers 
with fewer than h citations. The relation between Nc,tot 
and h will depend on the detailed form of the particular 
distribution m, 13, and it is useful to define the propor- 
tionality constant a as 



Nr. 



ah"- 



(1) 



I find empirically that a ranges between 3 and 5. 

Other prominent physicists with high /I's are A.J. 
Heeger [h = 107), M.L. Cohen [h ^ 94), A.C. Gos- 
sard [h = 94), P.W. Anderson {h = 91), S. Weinberg 
{h = 88), M.E. Fisher {h = 88), M. Cardona {h = 86), 
P.O. deGennes {h = 79), J.N. Bahcall [h = 77), Z. Fisk 



{h = 75), D.J. Scalapino {h = 75), G. Parisi {h = 73), 
S.G. Louie {h = 70), R. Jackiw {h = 69), F. Wilczek 
{h = 68), C. Vafa {h = 66), M.B. Maple {h = 66), D.J. 
Gross {h = 66), M.S. Dresselhaus {h = 62), S.W. Hawk- 
ing {h = 62). 

I argue that h is preferable to other single-number cri- 
teria commonly used to evaluate scientific output of a 
researcher, as follows: 

(0) Total number of papers (Np): Advantage: mea- 
sures productivity. Disadvantage: does not measure im- 
portance nor impact of papers. 

(1) Total number of citations (Nc^tot)'- Advantage: 
measures total impact. Disadvantage: hard to find; may 
be infiated by a small number of 'big hits', which may 
not be representative of the individual if he/she is coau- 
thor with many others on those papers. In such cases 
the relation Eq. (1) will imply a very atypical value of a, 
larger than 5. Another disadvantage is that Nc.tot gives 
undue weight to highly cited review articles versus origi- 
nal research contributions. 

(2) Citations per paper, i.e. ratio of Nc^tot to Np-. Ad- 
vantage: allows comparison of scientists of different ages. 
Disadvantage: hard to find; rewards low productivity, 
penalizes high productivity. 

(3) Number of 'significant papers', defined as the num- 
ber of papers with more than y citations, for example 
y = 50. Advantage: eliminates the disadvantages of cri- 
teria (0), (1), (2), gives an idea of broad and sustained 
impact. Disadvantage: y is arbitrary and will randomly 
favor or disfavor individuals; y needs to be adjusted for 
different levels of seniority. 

(4) Number of citations to each of the q most cited 
papers, for example g = 5. Advantage: overcomes many 
of the disadvantages of the criteria above. Disadvantage: 
it is not a single number, making it more difficult to ob- 
tain and compare. Also, q is arbitrary and will randomly 
favor and disfavor individuals. 

Instead, the proposed h-index measures the broad im- 
pact of an individual's work; it avoids all the disadvan- 
tages of the criteria listed above; it usually can be found 
very easily, by ordering papers by 'times cited' in the 
Thomson ISI Web of Science database 3J ; it gives a ball- 
park estimate of the total number of citations, Eq. (1). 

Thus I argue that two individuals with similar h are 
comparable in terms of their overall scientific impact, 
even if their total number of papers or their total number 
of citations is very different. Conversely, that between 
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two individuals (of the same scientific age) with similar 
number of total papers or of total citation count and very 
different /i-values, the one with the higher h is likely to 
be the more accomplished scientist. 

For a given individual one expects that h should in- 
crease approximately linearly with time. In the simplest 
possible model, assume the researcher publishes p papers 
per year and each published paper earns c new citations 
per year every subsequent year. The total number of 
citations after n + 1 years is then 

• pcn(n + 1) 
Nctot = }2pcJ = (2) 

Assuming all papers up to year y contribute to the index 
h we have 

(n — y)c = h (3a) 
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FIG. 1: The intersection of the 45 degree line with the curve 
giving the number of citations versus the paper number gives 
h. The total number of citations is the area under the curve. 
Assuming the second derivative is non-negative everywhere, 
the minimum area is given by the distribution indicated by 
the dotted line, yielding a=2 in Eq. 1. 



py = h 



(3b) 



The left side of Eq. (3a) is the number of citations to the 

most recent of the papers contributing to h; the left side 
of Eq. (3b) is the total number of papers contributing to 
h. Hence from Eq. (3), 



h = 



1 + c/p 



(4) 



The total number of citations (for not too small n) is 
then approximately 
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of the form Eq. (1). The coefficient a depends on the 
number of papers and the number of citations per paper 
earned per year as given by eq. (5). As stated earlier we 
find empirically that a ~ 3 to 5 are typical values. The 
linear relation 

h ^ mn (6) 

should hold quite generally for scientists that produce pa- 
pers of similar quality at a steady rate over the course of 
their careers, of course m will vary widely among differ- 
ent researchers. In the simple linear model, m is related 
to c and p as given by eq. (4). Quite generally, the slope 
of h versus n, the parameter m, should provide a useful 
yardstick to compare scientists of different seniority. 

In the linear model, the minimum value of a in Eq. 
(1) is a = 2, for the case c = p, where the papers with 
more than h citations and those with less than h citations 
contribute equally to the total Nctot- The value of a 
will be larger for both c > p and c < p. For c > p, 
most contributions to the total number of citations arise 
from the 'highly cited papers' (the h papers that have 
Nc > h), while for c < p it is the sparsely cited papers 



(the Np — h papers that have fewer than h citations each) 
that give the largest contribution to Nc^tot- We find that 
the first situation holds in the vast majority, if not all, 
cases. For the linear model defined in this example, a = 4 
corresponds to c/p = 5.83 (the other value that yields 
a = 4, c/p = 0.17, is unrealistic). 

The linear model defined above corresponds to the dis- 
tribution 



Nc{y) = No-{^- l)y 



(7) 



where Nc{y) is the number of citations to the y-th paper 
(ordered from most to least cited) , and Nq is the number 
of citations of the most highly cited paper {Nq = cn in 
the example above). The total number of papers ym is 
given by Nc{ym) = 0, hence 



Nph 
No-h 



(8) 



We can write A^o and ym in terms of a defined in Eq. (1) 
as : 



No = h[a ± y/a^ - 2a] 



ym = h[a =F -\/a^ - 2a] 



(9a) 



(9b) 



For a = 2, No — ym — 2h. For larger a, the upper sign 
in Eq. (9) corresponds to the case where the highly cited 
papers dominate (more realistic case) and the lower sign 
where the low-cited papers dominate the total citation 
count. 

In a more realistic model, Nc{y) will not be a linear 
function of y. Note that a = 2 can safely be assumed to 

be a lower bound quite generally, since a smaller value 
of a would require the second derivative d^Nc/dy"^ to be 
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negative over large regions of y which is not realistic. The 
total number of citations is given by the area under the 
Nc{y) curve, that passes through the point Nc{h) = h. 
In the linear model the lowest a — 2 corresponds to the 
line of slope —1, as shown in Figure 1. 

A more realistic model would be a stretched exponen- 
tial of the form 



i—f 



(10) 



Note that for /? < 1, iV"(y) > for all hence a > 2 is 
true. We can write the distribution in terms of h and a 
as 



Nc{y) 



with I{(3) the integral 



a , _/ y \f< 



aliP) 



I{(3) = / dze- 
Jo 

and a determined by the equation 
ae 



m 



The maximally cited paper has citations 



al{(3) 



(11) 



(12) 



(13) 



(14) 



and the total number of papers (with at least one cita- 
tion) is determined by N{ym) = 1 as 



ym = h[l + a^ln{h)]^/^ 



(15) 



A given researcher's distribution can be modeled by 
choosing the most appropriate /3 and a for that case. For 
example, for P — 1, ii a = 3, a — 0.661 and iVo — 4.54ft,, 
y,n = h[l + Mlnh]. With a = 4, a = 0.4644, No = 8.61h 
and ym = ^[1 + 0A6ln{h)]. For (3 = 0.5, the lowest 
possible value of a is 3.70; for that case, A'o = 7Ah, 
ym = h[l + 0.5ln{h)]'^. Larger a values will increase A^o 
and reduce ym- For /3 = 2/3, the smallest possible a is 
a — 3.24, for which case A^o — 4.5/i and ym — h[l + 
OMln{h)]^/^. 

The linear relation between h and n Eq. (6) will of 
course break down when the researcher slows down in pa- 
per production or stops publishing altogether. There is a 
time lag between the two events. In the linear model as- 
suming the researcher stops publishing after rigtop years, 
h continues to increase at the same rate for a time 
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(16) 



and then stays constant, since now all published papers 
contribute to h. In a more realistic model h will smoothly 
level off as n increases rather than with a discontinuous 
change in slope. Still quite generally the time lag will be 



larger for scientists who have published for many years 
as Eq. (16) indicates. 

Furthermore in reality of course not all papers will 
eventually contribute to h. Some papers with low cita- 
tions will never contribute to a researcher's h, especially 
if written late in the career when h is already appreciable. 
As discussed by Redner[3|, most papers earn their cita- 
tions over a limited period of popularity and then they 
arc no longer cited. Hence it will be the case that pa- 
pers that contributed to a researcher's h early in his/her 
career will no longer contribute to h later in the individ- 
ual's career. Nevertheless it is of course always true that 
h cannot decrease with time. The paper or papers that 
at any given time have exactly h citations are at risk of 
being eliminated from the individual's ft-count as they 
are superseded by other papers that are being cited at a 
higher rate. It is also possible that papers 'drop out' and 
then later come back into the /i-count, as would occur for 
the kind of papers termed 'sleeping beauties' 

For the individual researchers mentioned earlier I find 
n from the time elapsed since their first published pa- 
per till the present, and find the following values for the 
slope m defined in Eq. (6): Witten, m ~ 3.89; Heeger, 
TO = 2.38; Cohen, to = 2.24; Gossard, m = 2.09; Ander- 
son, TO = 1.88; Weinberg, to — 1.76; Fisher, to = 1.91; 
Cardona, m = 1.87; dcGennes, to = 1.75; Bahcall, 
TO = 1.75; Fisk, to — 2.14; Scalapino, to = 1.88; Parisi, 
TO = 2.15; Louie, to = 2.33; Jackiw, to — 1.92; Wilczek, 
TO = 2.19; Vafa, to = 3.30; Maple, to — 1.94; Gross, 
TO = 1.69; Dresselhaus, to = 1.41; Hawking, to — 1.59. 
From inspection of the citation records of many physicists 
I conclude: 

(1) A value to ~ 1, i.e. an h index of 20 after 20 years 
of scientific activity, characterizes a successful scientist. 

(2) A value to ~ 2, i.e. an /i-index of 40 after 20 years 
of scientific activity, characterizes outstanding scientists, 
likely to be found only at the top universities or major 
research laboratories. 

(3) A value to ^ 3 or higher, i.e. an ft-index of 60 after 
20 years, or 90 after 30 years, characterizes truly unique 
individuals. 

The TO-parameter ceases to be useful if a scientist does 
not maintain his/her level of productivity, while the h- 
parameter remains useful as a measure of cumulative 
achievement that may continue to increase over time even 
long after the scientist has stopped publishing altogether. 

Based on typical h and to values found, I suggest that 
(with large error bars) for faculty at major research uni- 
versities /i ~ 10 to 12 might be a typical value for ad- 
vancement to tenure (associate professor), and h ^ 18 for 
advancement to full professor. Fellowship in the Amer- 
ican Physical Society might occur typically for ft ~ 15 
to 20. Membership in the US National Academy of Sci- 
ences may typically be associated with ft ~ 45 and higher 
except in exceptional circumstances. Note that these es- 
timates correspond roughly to typical number of years of 
sustained research production assuming an to '--^ 1 value, 
the time scales of course will be shorter for scientists with 
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higher m values. Note that the time estimates are taken 
from the pubhcation of the first paper which typically 
occurs some years before the Ph.D. is earned. 

There are however a number of caveats that should be 
kept in mind. Obviously a single number can never give 
more than a rough approximation to an individual's mul- 
tifaceted profile, and many other factors should be con- 
sidered in combination in evaluating an individual. This 
and the fact that there can always be exceptions to rules 
should be kept in mind especially in life-changing deci- 
sions such as the granting or denying of tenure. There 
will be differences in typical /i- values in different fields, 
determined in part by the average number of references 
in a paper in the field, the average number of papers 
produced by each scientist in the field, and also by the 
size (number of scientists) in the field (although to a first 
approximation in a larger field there are more scientists 
to share a larger number of citations, so typical /i-values 
should not necessarily be larger). Scientists working in 
non-mainstream areas will not achieve the same very high 
h values as the top echelon of those working in highly 
topical areas. While I argue that a high /i is a reli- 
able indicator of high accomplishment, the converse is 
not necessarily always true. There is considerable varia- 
tion in the skewness of citation distributions even within 
a given subfield, and for an author with relatively low h 
that has a few seminal papers with extraordinarily high 
citation counts, the h-mdex will not fully reflect that sci- 
entist's accomplishments. Conversely, a scientist with a 
high h achieved mostly through papers with many coau- 
thors would be treated overly kindly by his/her h. Sub- 
fields with typically large collaborations (eg high energy 
experiment) will typically exhibit larger /i-values, and I 
suggest that in cases of large differences in the number 
of coauthors it may be useful in comparing different indi- 
viduals to normalize /i by a factor that reflects the aver- 
age number of coauthors. For determining the scientific 
'age' in the computation of m, the very first paper may 
sometimes not be the appropriate starting point if it rep- 
resents a relatively minor early contribution well before 
sustained productivity ensued. 

Finally, in any measure of citations ideally one would 
like to eliminate the self-citations. While self-citations 
can obviously increase a scientist's h, their effect on h 
is much smaller than on the total citation count. First, 
all self-citations to papers with less than h citations are 
irrelevant, as are the self-citations to papers with many 
more than h citations. To correct h for self-citations one 
would consider the papers with number of citations just 
above h, and count the number of self-citations in each. If 
a paper with h+n citations has more than n self-citations, 
it would be dropped from the /i-count, and h would drop 
by 1 . Usually this procedure would involve only very few 
if any papers. As the other face of this coin, scientists 
intent in increasing their /i-index by self-citations would 
naturally target those papers with citations just below h. 

As an interesting sample population I computed h and 
m for the physicists that obtained Nobel prizes in the 
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FIG. 2: Histogram giving number of Nobel-prize recipients 
in Physics in the last 20 years versus their h-index. The peak 
is at h-index between 35 and 39. 



last 20 years (for calculating m I used the latter of the 
first published paper year or 1955, the first year in the 
ISI database). However the set was further restricted 
by including only the names that uniquely identified the 
scientist in the ISI citation index. This restricted our 
set to 76% of the total, it is however still an unbiased 
estimator since the commonality of the name should be 
uncorrelated with h and m. /i-indices range from 22 to 
79, m-indices from 0.47 to 2.19. Averages and standard 
deviations are < h >— 41, ah = 15, and < m >— 1.14, 
am = 0.47. The distribution of /i-indices is shown in Fig- 
ure 2, the median is at hm = 35, lower than the mean due 
to the tail for high h values. It is interesting that Nobel 
prize winners have substantial h indices (84% had h of at 
least 30), indicating that Nobel prizes do not originate 
in one stroke of luck but in a body of scientific work. 
Notably the values of m found are often not high com- 
pared to other successful scientists (49% of our sample 
had m < 1). This is clearly because Nobel prizes are 
often awarded long after the period of maximum produc- 
tivity of the researchers. 

As another example, among newly elected members 
in the National Academy of Sciences in Physics and As- 
tronomy in 2005 I find < h 44, a^ = 14, highest 
h = 71, lowest h = 20, median hm = 46. Among the 
total membership in NAS in Physics the subgroup of last 
names starting with A and B has < h >= 38, ah — 10, 
hm — 37. These examples further indicate that the in- 
dex ft, is a stable and consistent estimator of scientific 
achievement. 

An intriguing idea is the extension of the h-index con- 
cept to groups of individuals The SPIRES high en- 
ergy physics literature database^ recently implemented 
the h-index in their citation summaries, and it also al- 
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lows the computation of h for groups of scientists. The 
overall /i-index of a group will generally be larger than 
that of each of the members of the group but smaller 
than the sum of the individual /i-indices, since some of 
the papers that contribute to each individual's h will no 
longer contribute to the group's h. For example, the over- 
all /i-index of the condensed matter group at the UCSD 
physics department is h = 118, of which the largest in- 
dividual contribution is 25; the highest individual h is 
66, and the sum of individual /I's is above 300. The 
contribution of each individual to the group's h is not 
necessarily proportional to the individual's h, and the 
highest contributor to the group's h will not necessarily 
be the individual with highest h. In fact, in principle 
(although rarely in practice) the lowest-ft, individual in a 
group could be the largest contributor to the group's h. 
For a prospective graduate student considering different 
graduate programs, a ranking of groups or departments 
in his/her chosen area according to their overall ft,-index 
would likely be of interest, and for administrators con- 
cerned with these issues the ranking of their departments 
or entire institution according to the overall h could also 
be of interest. 

To conclude, I discuss some observations in the fields 
of biological and biomedical sciences. From the list com- 
piled by Christopher King of Thomson ISI of the most 
highly cited scientists in the period 1983-2002 8J, I found 
the ft,— indices for the top 10 on that list, all in the 
life sciences, which are, in order of decreasing h: S.H. 
Snyder, h = 191; D. Bahimore, h = 160; R.C. Gallo, 
h = 154; P. Chambon, h = 153; B. Vogelstein, h ^ 151; 
S. Moncada, h = 143; C.A. Dinarello, h = 138; T. 
Kishimoto, h = 134; R. Evans, h = 127; A. Ullrich, 
/i = 120. It can be seen that not surprisingly all these 



highly cited researchers also have high /i— indices, and 
that high ft— indices in the life sciences are much higher 
than in physics. Among 36 new inductees in the National 
Academy of Sciences in biological and biomedical sciences 
in 2005 I find < ft >= 57, an = 22, highest ft = 135, 
lowest ft 18, median hm = 57. These latter results 
confirm that ft— indices in biological sciences tend to be 
higher than in physics, however they also indicate that 
the diff'erence appears to be much higher at the high end 
than on average. Clearly more research in understand- 
ing similarities and differences of ft— index distributions 
in different fields of science would be of interest. 

In summary, I have proposed an easily computable in- 
dex, ft, which gives an estimate of the importance, sig- 
nificance and broad impact of a scientist's cumulative 
research contributions. I suggest that this index may 
provide a useful yardstick to compare different individu- 
als competing for the same resource when an important 
evaluation criterion is scientific achievement, in an unbi- 
ased way. 
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